{tools}[foss/2025b] PyTorch v2.9.1, parameterized v0.9.0, pytest-subtests v0.15.0, ... w/ CUDA 12.9.1#24926
Conversation
…tests-0.15.0-GCCcore-14.3.0.eb, PyTorch-2.9.1-foss-2025b-CUDA-12.9.1.eb, unittest-xml-reporting-3.2.0-GCCcore-14.3.0.eb and patches: PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.6.0_show-test-duration.patch, PyTorch-2.6.0_skip-test_segfault.patch, PyTorch-2.7.0_avoid_caffe2_test_cpp_jit.patch, PyTorch-2.7.1_avoid-caffe2-sandcastle-test-lib.patch, PyTorch-2.7.1_skip-test_data_parallel_rnn.patch, PyTorch-2.7.1_skip-test_gds_fails_in_ci.patch, PyTorch-2.7.1_skip-test_mixed_mm_exhaustive_dtypes.patch, PyTorch-2.7.1_skip-tests-requiring-SM90.patch, PyTorch-2.7.1_suport-64bit-BARs.patch, PyTorch-2.7.1_tolerance-test_partial_flat_weights.patch, PyTorch-2.9.0_disable-test_nan_assert.patch, PyTorch-2.9.0_enable-symbolizer-in-test_workspace_allocation_error.patch, PyTorch-2.9.0_fix-attention-squeeze.patch, PyTorch-2.9.0_fix-FP16-CPU-tests-in-test_torchinductor_opinfo.patch, PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_fix-test_exclude_padding.patch, PyTorch-2.9.0_fix-test_version_error.patch, PyTorch-2.9.0_honor-XDG_CACHE_HOME.patch, PyTorch-2.9.0_increase-tolerance-in-test_transformers.patch, PyTorch-2.9.0_remove-faulty-close.patch, PyTorch-2.9.0_revert-pybind11-3-change.patch, PyTorch-2.9.0_skip-test_benchmark_on_non_zero_device.patch, PyTorch-2.9.0_skip-test_convolution1-on-H100.patch, PyTorch-2.9.0_skip-test_inductor_all_gather_into_tensor_coalesced.patch, PyTorch-2.9.0_skip-test_original_aten_preserved_pad_mm.patch, PyTorch-2.9.0_skip-test_override-without-CUDA.patch, PyTorch-2.9.0_skip-test_unbacked_reduction.patch, PyTorch-2.9.0_skip-tests-requiring-CUDA-12.8.patch, PyTorch-2.9.0_skip-unexpected-success-in-test_fake_export.patch, PyTorch-2.9.1_skip-RingFlexAttentionTest.patch
Updated software
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
Are you using the latest easyblock? It is missing this commit from easybuilders/easybuild-easyblocks#3803 |
This comment was marked as outdated.
This comment was marked as outdated.
|
2025b is using GCC 14 that has new warnings. See pytorch/pytorch#166873 Patch added. Seems to only affect ARM |
This comment was marked as outdated.
This comment was marked as outdated.
|
Oh, it is a C file. Updated the patch to also add it to C-flags |
This comment was marked as outdated.
This comment was marked as outdated.
|
Looks like I need to set those values earlier. Can you try again? |
|
Actual failure was an internal GCC compiler error: Test report by @Thyre |
|
Failure may be caused by this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121027 There was a PR which should have worked around this, but seemingly the fix doesn't work?
Maybe we need to patch |
That is not included in this (or any) release yet. I'll add it to the patch list
Would be an option, not sure if it is worth it: This EC is included since EB 5.1.0, although we did that in the past |
|
Worth noting here that I've been trying this on GH200 nodes, and the testing was taking over a week when the node died for other reasons. |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
f32c592 to
c30732f
Compare
There were 17 test failures.
Fixed by PyTorch-2.9.1_fix-DDPCommHookType-python-3.13.patch
Issue with new Python version ("cannot pickle code object") which I'd just let fail
Happens on nodes with > 4 GPUs: pytorch/pytorch#162746
pybind11 ABI changes while test checks for exact strings
Known failure but PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch can't be easily updated
IMPORTANT:
I tested this many times interactively and can't reproduce the segfaults nor find any hint for a cause, so I just skipped that. So 9 out of 17 and the uncounted suite fixed, the remaining 8 should be ok to keep. |
Fixed by using patch from upstream now |
|
Test report by @boegel |
|
After the benchmarks it seems like we should include ACL and MKL in our builds. Shall we merge this first and make new PRs for adding them or add them here (and in other open PyTorch PRs) before merging? |
I'd say we can do that in a separate PR, to make the change/fix stand out? |
|
@Flamefire please sync with |
Also, it's not clear to me yet if it's only Just for context: |
It was a change to yet unmerged ECs, hence the question. But of course can do in separate PRs after having those merged. Makes the changeset smaller.
https://github.com/intel/mkl-dnn forwards to https://github.com/uxlfoundation/oneDNN |
|
Test report by @boegel |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 4208724136 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
(created using
eb --new-pr)